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SYSTEM Mm METHOD FOR GESTURE INTERFACE 
BACKGROUND OF THE INVENTION 

1 . Field of the Invention : 

The present invention relates to computer interfaces, and 
more particularly to a real-time gesture interface for use in 
medical visualization workstations. 

2 . Discussion of the Prior Art : 

\:In many environments, traditional hands-on user 
interfab^, for example, a mouse and keyboard, for interacting 
with a compulser are not practical. One example of such an 
environment is anspperating theater (OT) where there is a need 
for strict sterility?\A surgeon, and everything coming into 
contact with his/her hano^kmust be sterile. Therefore, the 
mouse and keyboard maybe exclxKied from consideration as an 
interface because they may not be'^^erilized. 

A computer may be used in the OT for medical imaging. The 
interaction can include commands to display different images, 
scrolling through a set of two-dimensional (2D) images, 
changing imaging parameters (window/level), etc. With advances 
in technology, there is a growing demand for three-dimensional 
(3D) visualizations. The interaction and manipulation of 3D 
models is intrinsically more complicated than for 2D models 
even if a mouse and keyboard can be used, because the commands 
may not be intuitive when working in 3D. Examples of commands 
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in a 3D medical data visualization environment include 
rotations and translations including zoom. 

\Areas of human-machine interaction in the OT include, for 
example /\voice recognition and gesture recognition. There are 
several comth^cially voice recognition systems available. In 
the context of c4^e OT, their advantage is that the surgeon can 
continue an activityv for example, a suture, while commanding 
the imaging system. Howler, the disadvantage is that the 
surgeon needs to mentally ct;;anslate geometric information into 
language: e.g., "turn right", \zoom in", ''stop". These 
commands need to include some typevof qualitative information. 
Therefore, it can be complicated and>^iresome to achieve a 
specific 3D orientation. Other problems\related to voice 
recognition are that it may fail in a noisy environment, and 
the system may need to be trained to each usesr. 

Researchers have attempted to develop systems that can 
provide a natural, intuitive human-machine interface. Efforts 
have been focused on the development of interfaces without 
mouse or device based interactions. In the OT, the need for 
sterility warrants the use of novel schemes for human-machine 
interfaces for the doctor to issue . commands to a medical 
imaging workstation. 

Gesture recognition includes two sequential tasks, 
feature detection/extraction and pattern 

recognition/classification. A review of visual interpretation 
of hand gestures can be found in V.I. Pavlovic, R. Sharma, and 
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T.S. Huang, "Visual interpretation of hand gestures for human- 
computer interaction, A Review", IEEE Transactions on Pattern 
Analysis and Machine Intelligence, 19 (7) : 677-695, July 1997 . 

For feature detection/extraction, applications may use 
color toNletect human skin. An advantage of a color-based 
technique is\real-time performance. However, the variability 
of skin color rn varying lighting conditions can lead to false 
detection. Some applications use motion to localize the 
gesture. A drawbackNof a motion cue approach is that 
assumptions maybe neeo^d to make the system operable, e.g., a 
stationary background anayone active gesturer. Other methods, 
such as using data-gloves/se^sors to collect 3D data, may not 
be suitable for a human-machine^nterf ace because they are not 
natug^l . 

For pattern recognition and classification, several 
techniques have been proposed. Hidden Markov Model (HMM) is 
one method. HMM can be used for, for example, the recognition 
of American Sign Language (ASL) . One approach uses motion- 
energy images (MEI) and motion-history images (MHI) to 
recognize gestural actions. Computational simplicity is the 
main advantage of such a temporal template approach. However, 
motion of unrelated objects may be present in MHI. 

Neural networks are another tool used for recognition. In 
particular, a time-delay neural network (TDNN) has 
demonstrated the capability to classify spatio-temporal 
signals. TDNN can also be used for hand gesture recognition. 
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However, TDNN may not be suitable for some environments such 
as an OT, wherein the background can include elements 
contributing to clutter. 

Therefore, a need exists for a system and method for a 
real-time interface for medical workstations. 

SUMMARY OF THE INVENTION 

According to an embodiment of the present invention, a 
method is provided for determining a gesture. The method 
includes determining a change in a background of an image from 
a plurality of images, and determining an object in the 
image. The method further includes determining a trajectory of 
the object through the plurality of images, and classifying a 
gesture according to the trajectory of the object. 

Determining the change in the background includes 
determining a gradient intensity map for the background from a 
plurality of images, determining a gradient intensity map for 
the current image, and determining, for a plurality of pixels, 
a difference between the gradient intensity map and the 
gradient intensity map for the background. Determining the 
change in the background further includes determining a 
comparison between the difference and a threshold, and 
determining a pixel to be a background pixel according to the 
comparison. 

The object includes a user's hand. 



Determining the object in the image includes obtaining a 
normalized color representation for a plurality of colors in 
each image, determining from training images an estimate of a 
probability distribution of normalized color values for an 
5 object class, and determining, for each pixel, a likelihood 

according to an estimated probability density of normalized 
yi color values for the object class. 

s 

p. Determining the trajectory of the object through the 

plurality\of images further comprises determining, for each 
pixel, a temporal likelihood across a plurality of images, and 
determining a pluis^lity of moments according to the temporal 
^ i kel i hoods . 

etermining the trajectory includes determining a difference 
in a size of the object overSsa pre-determined time period, 
15 determining a plurality of angJ^s between a plurality of lines 

connecting successive controids over the time period, and 
determining a feature vector according\to the angles and 
lines , - 

The method further includes classifying the feature vector 
20 according to a time-delay neural network, wherein a feature is 

of a fixed length. 

Classifying the gesture includes determining a reference 

point, determining a correspondence between the trajectory and 

the reference point, and classifying the trajectory according 
25 to one of a plurality of commands. 
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According to an embodiment of the present invention, a 
method is provided for determining a trajectory of a hand 
through a plurality of images. The method includes detecting a 
reference point, updating the reference point as the reference 
point is varied, and detecting a first translation of the hand 
through the plurality of images. The method further includes 
detecting a second translation through the plurality of 
images, determining a gesture according a vote, and 
determining whether the gesture is a valid gesture command. 

The reference point is not interpreted as a gesture 
command. The reference point is characterized by hand size and 
a location of a centroid of the hand in each image. 

The first translation is one of a forward and a backward 
translation, wherein the first translation is characterized by 
a large change in hand size and a relatively small change in a 
centroid of the hand- The second translation is one of a left, 
a right, an up and a down translation. 

Detecting the second translation includes determining a 
normalized vector between two centroids Ct and Ct-i as a feature 
vector, wherein there are three output patterns. The three 
output patterns are a vertical movement, a horizontal 
movement, and an unknown. The method further includes 
comparing the reference point to a centroid upon determining 
the translation to be a vertical or a horizontal translation, 
and testing an input pattern upon determining the translation 
to be an unknown translation. Testing an input pattern further 



comprises detecting a circular movement, wherein an angle 
between vector CtCt-i and vector Ct-iCt^2 is determined as the 
feature vector. 

The valid gesture is performed continually for a 
predetermined time . 

According to an embodiment of the present invention, a 
program storage device is provided readable by machine, 
tangibly embodying a program of instructions executable by the 
machine to perform method steps for determining a gesture . The 
method includes determining a change in a background of an 
image from a plurality of images, determining an object in the 
image, determining a trajectory of the object through the 
plurality of images, and classifying a gesture according to 
the trajectory of the object. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Preferred embodiments of the present invention will be 
described below in more detail, with reference to the 
accompanying drawings : 

Fig. 1 is a screenshot of the Fly- through visualization 
tool according to an embodiment of the present invention; 

Fig. 2 is an image showing a user's operating hand in an 
image according to an embodiment of the present invention; 

Fig. 3 shows modules of the gesture interface for medical 
workstations according to an embodiment of the present 
invention; 



Fig. 4 shows a hierarchy of TDNN based classifier 
according to an embodiment of the present inventions- 
Figs . 5a-d show an example of a method of discriminating 
movements according to an embodiment of the present invention; 
and 

Figs. 6a-h show an example of a method of determining a 
hand gesture wherein the hand is not held stationary according 
to an embodiment of the present invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

A system and method for a computer interface detects 
changes in a background portion of an image, classifies an 
object of interest based on color properties in the image, and 
extracts and classifies a gesture feature. The resulting 
classification results can be used to control a 3D 
visualization system for medical image data, for example, Fly- 
Through. This system and method can achieve real-time 
performance in cluttered background settings. Further, the 
system and method can be implemented in conjunction with a 
medical image visualization system or method. 

3D Virtuoso is a postprocessing workstation from Siemens 
that has many 3D tools. One of these tools, Fly-Through, is a 
dedicated tool for Virtual Endoscopy Simulation. Besides 
generic 3D rendering capabilities, it has a viewpoint that 
shows a view of a cavity, for example, a trachea or colon, 
from a viewpoint inside the body, the virtual endoscope. Fig. 
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1, is a screenshot of a visualization tool, in this case, Fly- 
Through, showing a global view of the data 101 as well as a 
virtual endoscope view 102 from a user defined vantage point. 

According to an embodiment of the present invention, the 
system and method can imitate the manipulation of an 
endoscope. The system and method allow the user to, for 
example, push, pull, pivot and turn a virtual endoscope. These 
and other commands can provide gesture recognition. Gestures 
can include, for example, degrees of translations including 
left, right, up, down, forward, and backward, and circular 
movements including clockwise and counterclockwise. Circular 
movements are viewed as rotations in the gesture interface. As 
Fig. 2 shows, a camera is fixed in front of a user's hand 201. 
A valid gesture command needs to be performed continually for 
a predetermined time to initialize the command. Repetition of 
a gesture, e.g., more than two times, can be considered as a 
valid command. For example, to drive the virtual endoscope to 
the left, the user may wave his hand from right to left, from 
left to right, and continue this movement until the virtual 
endoscope moves to the desired position. Thus, a high 
recognition rate, e.g., 95%, using hand gestures can be 
obtained. 

The design of gestures can be important to a gesture 
interface. It may not be reasonable to ask a user to keep 
his/her hand in the visual field of the camera at all times. 
Also, meaningless hand movements need to be disregarded by the 

9 




human-machine interface. For example, after performing a 
gesture, the user may want to move his/her hand out of the 
camera's field of view to do other operations, e.g., to make 
an incision. These kinds of hand movements are allowed and the 
HMI needs to ignore them. After the user initializes a valid 
gesture command, the system executes the command so long as 
the gesture continues. For example, the longer a gesture is 
performed, the larger movement the virtual endoscope makes in 
the case of Fly-Through. 

Consider two valid gesture commands, move left and move 
right. Bo\h commands may need the user's hand be waved 
horizontal ly^^and the user can continue this movement as many 
times as desireoV Given no information about where the 
movement starts, the^ maybe no way to distinguish between the 
motion trajectory patteisns, e.g., left or right waves. Similar 
ambiguities can occur when^ther translations are performed. 
For this reason, the system and method needs to know or 
determine a starting point for aVgesture command. According to 
an embodiment of the present invencsion, by holding the hand 
stationary before performing a new ges^ture, the stationary 
point becomes a reference point. The reference point is used 
to distinguish among, for example, moving l\ft or right, up or 

down, ^nii ^r>ini-Mrri h i i 1 ■ ■ ■ 1 i i i i i 

A gesture command can include various gestures, for 
example, using the representation of circular movements of a 
finger or rotating the hand to cause the view to rotate. In 
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this example, drawing circles may be easier for the user than 
rotating the hand. 

Referring to Fig. 3, the method includes detecting 
changes in the background of a video image in a sequence 301. 
The method can detect skin- tone of a user according to a 
Gaussian mixture model 302, A motion trajectory of, for 
example, the user's hand, can be extracted from the video 
sequence 303. TDNN based motion pattern classification 304 can 
be used to classify a hand gesture. The system sends the 
classification results to, for example, the Fly-Through 
visualization system. 

Nrhe system and method can detect changes in a background 
by deteismining an intensity of each image from video stream. 
To elimina^ve noise, a Gaussian filter can be applied to each 
image, A grad^nt map of pixel intensity can be determined. 
After determiningsJ:he gradient map of a current image frame, 
the gradient may is compared with the learned background 
gradient map. If a given\pixel differs less than a threshold 
between these two gradient maps, the pixel is determined to be 
a background pixel, and can be\;^rked accordingly, A pre- 
determined threshold can be used. Qine with ordinary skill in 
the art would appreciate, in light of\;lie present invention, 
that additional methods for selecting the ^threshold exist, for 
example, through knowledge of sensor characteristics or 
through normal illumination changes allowed in the background. 
According to an embodiment of the present invention tshe 
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largest area of connected background pixels can be treated as 
background region. 

According to an embodiment of the present invention, 
skin- tone detection can be based on a normalized color model 
using a learned mixture of Gaussian distributions. The use of 



f r g ^ 

normalized colors , 

^r-\-g + b r + g + b) 



can reduce the variance of 



skin color in an image. Also, it has been shown that skin 
color can be modeled by a multivariate Gaussian in HS (hue and 
saturation) space under certain lighting conditions. In 
general, for Gaussian mixture model with n components, the 
conditional probability density for an observation x of 
dimensionality: 

, « ^-i/2U-/i,)T5:r'(;r-//,) 

p{x\e)=Y.^, y - (1) 

where mixing parameter tti corresponds to the prior probability 
of mixture component i and each component is a Gaussian with 
mean vector ii± and covariance matrix Ei . According to an 
embodiment of the present invention, skin colors can be 
modeled in the normalized RG (red and green) space. With 
learned mean vectors /x, covariance matrix E, and known prior 
TT, a likelihood is determined for each pixel of the image 
according to Equation (1) above. According to one embodiment 
of the present invention, the likelihood of a pixel y) 
can be defined as: 
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if I(x,y) e foreground pixel; 
otherwise. 



(2) 



For a foreground pixel with its normalized color 
observation Xt the likelihood of the pixel is defined as its 
estimated density. For background pixels, the likelihood 
values are set to 0. A possible method to select skin pixels 
is to apply a simple threshold to Equation (2) . If the 
likelihood of a pixel is larger than the threshold, the pixel 
is then classified as a skin pixel. And the largest skin area 
of the image is often viewed as the detected skin object. 

^^he trajectory of the centroid of the detected skin 
object \^ often used as the motion trajectory of the object. 
However, rt has been determined that there are many objects 
having skin-Jsike color in an office environment. For example, 
a wooden booksh^f or a poster on a wall may be misclassif ied 
as a skin-like object. Therefore, the system and method 
attempts to eliminateSbackground pixels as discussed in above. 
Besides, the skin object^ (user's hand and probably the arm) 
are sometimes split up intoS^two or more blobs. Other skin 
regions such as face may also appear in the view of the 
camera. These problems together w!i±:h non-uniform illumination 
make the centroid vary dramatically \nd leads to false 
detections. For these reasons, a stable\motion trajectory is 
hard to obtain by just finding the largest\skin area. To 
handle these problems, a temporal likelihood >can be defined as 
L^(x, y, t) of each pixel I{x, y) as : \ 
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Ii^(j?c, y, t) = XL(x, y) + ( 1 - X) L^'U, y, t-1) 

where X is\a decay factor. Experiments show that a value of X 
equal to 0.5 oan be used. 

To select skin pixels, a threshold 6, is applied to the 
temporal likelihood jJ^^x, y, t) instead of likelihood L{x, y) 
of each pixel. Thus, the thresholded temporal likelihood of a 
pixel can be defined as: 

[0 otherwise. 
The moments of the image can be determined as follows: 
M',^ = \[U,{x,y,t)dxdy (5) 



^^0=^ (6) 

, {{yL',{x,y,t)dxdy 

According to an embodiment of the present invention, 
is viewed as the size of skin pixels. And ( M ,q , M q, ) is taken 
to form the motion trajectory. The present invention precisely 
classifies the user gesture. The system and method provide a 
reasonable solution to the extraction of trajectories of hand 
motions . 

R^«;ognition of a user's hand motion patterns can be 
accompli sheo^sa^ng TDNN according to an embodiment of the 
present invention. Experiments show that TDNN has good 
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performance on motion pattern classification. As shown by- 
experiments, TDNN has better performance if the number of 
output labels was kept small . Another advantage is that small 
number of output labels make networks simple and saves time at 
network training stage. For these reasons user's gestures are 
tested hierarchically. Further, TDNN applied hierarchically, 
has been determined to be suitable for the classification of 
the eight motion patterns described above. For instance, left 
movement and right movement have the common motion pattern of 
horizontal hand movement. Thus, once horizontal movement is 
detected, the range of the motion is compared with the 
reference point to differentiate these two gestures. 

Without introducing the reference point, the neural 
network has difficulty in discriminating the gestures. The 
input patterns of the TDNNs have a fixed input length. Since 
classification is to be performed in real-time as the user 
moves his hand, the motion patterns are classified along 
windows in time. At time t, the centroid Ct is obtained as 
described with respect to motion trajectory extraction. 

S^pose the length of an input pattern is w, the feature 
vectors { vWfi/ ^y-w^2f - • - , ^t] from {ct -w/ Ct-w+2 / • • • / Ct} are 
extracted to fo^nn a TDNN input pattern. When the maximum 
response from the neb^ork is relatively small, as compared 
with other label response^v the input pattern is classified as 
an unknown. Some false detecti^i^ or unknowns are inevitable. 
False detection can occur when the t^jsajectory of a translation 



similar to an arc of a circle. To minimize false detection 
and ofe^ain stable performance, a fixed number of past results 
are checkedT^^'When more than half of these past results 
indicate the same o^tput pattern, this output pattern is 
determined to be a finai. result . This method has been used to 
successfully obtain a reliahle recognition rate. 

Fig. 4 shows a hierarchy of the motion pattern classifier 
according to an embodiment of the present invention. For the 
detection of a reference point, when a user keeps his/her hand 
stationary 401 for a period of time, that is, both size and 
centroid are almost the same along some time internal, the 
method detects updates a reference point 4 02. The reference 
point will not be interpreted as a gesture command by the 
system and method. 

The method detects forward/backward translations 403. The 
skin size information obtained from Equation (5) can be used 
to determine a translation. Since the movement of forward or 
backward is roughly along the Z-axis of camera, these two 
translations are characterized by a dramatic change of skin 
size and subtle change of the centroid of the detected skin 
region. The estimated size of the hand is compared to the size 
of the hand when the reference point was initialized to 
differentiate between a forward and a backward movement. 

Further, the method can detect left/right/up/down 
translations 405. The normalized vector between centroids Ct 
and Ct-i is computed as the feature vector. There are three 
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output patterns: vertical movement, horizontal movement, and 
unknown. To determine whether a movement is vertical or 
horizontal, the centroid of the reference point is compared to 
the centroid currently estimated in the frame. If the result 
is unknown, e.g., can be a circular movement, the input 
pattern is tested at the next stage. 

)r the detection of circular movements, the angle 
between vector CtCt-i and vector is computed as the 

feature vecfi^ 406. This feature can distinguish between 
clockwise and counterclockwise circular movements. As 
expected, users car^draw circles from any position. In 
particular, a spiral wbuld be classified as one of the 
circular movements instead of a translation. Referring to Fig. 
4, the method can use a vouirig method 407 to check past 
results to form meaningful out^t, the system decreases the 
possibility of false classif icatdWi. The method determines 
whether a given gesture is a valid g^ture command 408. A 
valid gestures needs to be performed coHfinually in some time 
interv al to ^ n1 f "» '^*=^ command. 

Figs. 5 and 6 show some examples of our experimental 
result^ In each image, the black region, e.g., 501, is viewed 
as backgroii^d. The bounding box, e.g., 502 (highlighted in 
white in Fig. sfes^or clarity) , of each image indicates the 
largest skin area as^>i^termined by thresholded likelihood, 
^ Equation (2) . Note that bdunding boxes are only used for 

display. The arrow(s), e.g., 5Cr3s. on each bounding box show 



ttre classification result. A bounding box with no arrow, for 
examg^e, as in Figs. 5a-c, .on it means that the gesture is an 
unknown^attern, or that no movement has occurred, or 
insufficient data has been collected. Because we classify 
motion patterns\along windows in time, there may be some delay 
after a gesture isNmitialized (data is not sufficient for 
system to make a globa5s. decision) . 

According to an embodiment of the present invention, 
unintentional movements can be checked using a voting method 
407 to check past results to form meaningful outputs, thus, 
decreasing the possibility of false classification. Further, a 
user can change gestures without holding his/her hand 
stationary. For any two gestures, which can be distinguished 
without new reference point, for example, turn left and then 
up, or a translation to a circular movement, the user does not 
need to make hand stationary in between. In tests the system 
demonstrates a reliable and accurate performance. 

A need exists for an intuitive gesture interface for 
medical imaging workstations. The present invention proposes a 
real-time system and method that recognizes gestures to drive 
a virtual endoscopy system. The system and method can classify 
user's gesture as one of eight defined motion patterns: turn 
left/right, rotate clockwise/counterclockwise, move up/down, 
and move in depth in/out. Detecting composite gesture commands 
on a two-dimension plane need more modification. Besides, 
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current work takes advantage of the fact that some translation 
patterns 

are performed along the Z-axis of camera. With only one 
camera, designing a six degree-of -freedom gesture interface 
5 with more flexible camera position needs more research- The 

system and method have been tested in a laboratory setting and 
further work is needed to improve the system and to evaluate 
Cl it in a clinical setting. 

o 

l_j ■ 

Having described embodiments for a system and method for 
lip real-time gesture interface for medical workstations, it is 

s 

noted that modifications and variations can be made by persons 

h 

Iji skilled in the art in light of the above teachings. It is 

6 

P therefore to be understood that changes may be made in the 

particular embodiments of the invention disclosed which are 

15 within the scope and spirit of the invention as defined by the 

appended claims. Having thus described the invention with the 
details and particularity required by the patent laws, what is 
claimed and desired protected by Letters Patent is set forth 
in the appended claims. 
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